er. The mapping function is shown below, where ݂ሺ܆ሻ is a
and ॷ is a label vector for four countries:
ॷൌ݂ሺ܆ሻ
the k-mers were ranked using a supervised machine learning
examine which k-mers dominated the difference of the genomic
etween countries.
h, how the genomics patterns were evolved through time in these
was investigated. This requires a regression analysis model and
t is shown below, where ॻ stands for the time lag between the
on date of each sequence and the date of the first occurrence of
ॻൌ݂ሺ܆ሻ
nomics distribution of sequences
ganising map (SOM) [Kohonen, 1982] model with 900 neurons
tructed for the dual-normalised 3-mer data set for these 58,897
s from four countries. Figure 7.18 shows the SOM map generated
ohonen package. The map has shown a clear pattern that the
s from four countries have been mapped to almost distinct areas.
ap, eight cells were empty, which had no sequence mapped. The
rate was therefore 99.11%. Among 892 cells, 839 cells were
by the sequences from a unique country, such as USA, India,
r Brazil. This indicated that 94.06% cells were pure for the
s of one country. This means that the decomposition of 58,897
s into four subsets (݂ሺ܆ሻ⟹⋃ሼΩௌ, Ωூௗ, Ω௭, Ωோ௨௦௦ሽ)
essful and this set of sequences did have some intrinsic significant
nome pattern to discriminate the virus genome from different
. Among them, 711 neurons were pure for USA, 66 neurons were
ndia, 17 neurons were pure for Russia and 45 neurons were pure
l. Each of 711 neurons only witnessed USA sequences, each of
ns only witnessed India sequences, etc. Some neurons (or cells or
ve evidenced the overlap of sequences from different countries. This